In this project, we will look into Car Collision Report data, and analyze in detail what this data represents. We will make graphs, to visually analyze and find patterns, make predictions, and come to a better understanding on how we can reduce the amount of car collisions.
The data is obtained from dataMontgomery(https://data.montgomerycountymd.gov/Public-Safety/Crash-Reporting-Drivers-Data/mmzv-x632/data), a website that keeps records from the County of Montgomery in Maryland. The data has more than 67,000 entries, each being a driver. The data has been collected from January 1, 2015 and is updated weekly. It has 32 columns, mostly plain-text data about the circumstance involving the accidents, such as location, vehicle type, vehicle movement, type of road and date of accident. One thing to take into account is that some of the information is self-reported and has not been verified by a neutral party.
> #File locaiton on Dario's Machine
> dat = read.csv("/Users/dariomolina/DataScience/Crash-Report-Project/Crash_Reporting_-_Drivers_Data.csv",stringsAsFactors=FALSE)
> #Address on Evert's Machine
>
> #dat = read.csv("/Users/evert/Dropbox/School/Assignments/Crash_Reporting_-_Drivers_Data.csv",stringsAsFactors=FALSE)
Upon looking at the data, we realized that the columns Off.Road.Description, Related.Non.Motorist, and Non.Motorist.Substance.Abuse were all empty. Our first thought was that data was missing and we needed to replace the empty blanks with the value NA. But with further research we realized that the blanks in these columns represented a zero value. To put into perspective, an empty blank in OFF.Road.Description meant that there was no road description available. A blank in Related.Non.Motorist meant there was no non related motorist. With this knowledge in mind, the empty values in this columns to the integer value 0.
In total, there was about an 8% of missing data. The column Route.Type had 0.29351% in missing data. Road.Name was missing 0.28107%, Cross.Street.Type was missing 0.29365%, Cross.Street.Name was missing 0.28122%, Municipality was missing 2.81013%, Collision.Type was missing 0.00878%, Weather was missing 0.23114%, Surface.Condition was missing 0.36281%, Light was missing 0.02539%, Traffic.Control was missing 0.47553%, Driver.Substance.Abuse was missing 0.51511%, Non.Motorist.Substance.Abuse was missing 0.01827%, Circumstance was missing 2.51581%, Drivers.License.State was missing 0.12938%, Vehicle.Damage.Extent was missing 0.00465%, Vehicle.Body.Type was missing 0.04039%, and Vehicle.Movement was missing 0.00608% of data.
Most of the missing data was in the form of empty blanks or the string “N/A”. What we decided to do was convert all the empty blanks and the “N/A” into the value NA.
> x = c("Off.Road.Description", "Related.Non.Motorist", "Non.Motorist.Substance.Abuse")
>
> #assigning value 0 to "" in x vector
> for( i in x)
+ {
+ dat[[i]][dat[[i]] == ""] = 0
+ }
>
>
> #assignig rest values to NA
> for(i in names(dat))
+ {
+ dat[[i]][dat[[i]] == "" | dat[[i]] == "N/A"] = NA
+ }
>
>
>
> na_count <-sapply(dat, function(y) sum(length(which(is.na(y)))))
> na_count <- data.frame(sort(na_count,decreasing = TRUE))
> print(head(na_count,7))
sort.na_count..decreasing...TRUE.
Municipality 59207
Circumstance 53006
Driver.Substance.Abuse 10853
Traffic.Control 10019
Surface.Condition 7644
Cross.Street.Type 6187
Route.Type 6184
> dat_bi_daf = dat[dat$Related.Non.Motorist == "BICYCLIST" & dat$Driver.At.Fault == "Yes",]
>
> dat.sub1 <- subset(dat_bi_daf, Driver.Distracted.By == "LOOKED BUT DID NOT SEE")
> dat.sub2 <- subset(dat_bi_daf, Driver.Distracted.By == "NOT DISTRACTED")
>
> dat.sub <- rbind(dat.sub1,dat.sub2)
> dat.subwb = dat[!(dat$Report.Number %in% dat.sub$Report.Number),]
Before we started analyzing the data in detail, and since we are dealing with crash report type, we wanted to figure out which crash types were the most common.
> par(mar=c(3,14,2,2))
> x = table(dat$Collision.Type)
> barplot(x[x > 5000], horiz=TRUE, las=1, main=" ", col="firebrick")
From the graph, we realized that the most common type of car crash is same direction rear end. Which we assumed it would be either same direction rear end, or face to face collision like when two cars in opposite direction crash.
We then thought, since we know what are the most common types of car crash, what time during the year do they happen the most? In what month are car crashes most common? Do they rely on seasons, by festivities?
> #amount of crashes per month
> x =table(substr(dat$Crash.Date.Time,1,2))
> names(x) = c("Jan.","Feb.","Mar.","Apr.","May","Jun.","Jul.","Aug.","Sep.","Oct.","Nov.","Dec.")
> barplot(x, las = 1, col="firebrick")
From the graph we can see January is when most crashes happen. We had previously anticipated that perhaps most crashes would happen in the summer, since students are out from school, people go out on vacation, perhaps June and July would be the highest. But to our surprise January had the most. New Year resolution, new year, new car crash.
We were interested in figuring out, from the accidents reported, what were the most common types of report types. Because of the crashes, did many more fatalities happened than property damage? Which type of crash happend had the most different types of report types?
> y = table(dat$ACRS.Report.Type, dat$Collision.Type)
> x= c("STRAIGHT MOVEMENT ANGLE","SINGLE VEHICLE","SAME DIR REAR END","OTHER","HEAD ON LEFT TURN")
> y = y[,x]
>
> par(mar=c(3,2,3,2))
> plot(y, las=1,col="firebrick", main=" ")
After exploring the data further into detail, we were interested in finding out for the most common types of crashes, what were the most common types of reports produced. My partner and I thought fatality was going to have a bigger impact, but to our surprise it had the least impact.
Since we had analyzed the different most common types of car crash, and what time in the year they happen the most, we were curious to see under what kind of weather, what types of car crashes happend the most.
> z = table(dat$Weather,dat$Collision.Type)
> y= z[,c("STRAIGHT MOVEMENT ANGLE","SINGLE VEHICLE","SAME DIR REAR END","SAME DIRECTION RIGHT TURN","SAME DIRECTION LEFT TURN","HEAD ON LEFT TURN","OTHER")]
> y = y[c("CLEAR","CLOUDY","RAINING"),]
>
> par(mar=c(4,.5,4,.3))
> plot(y, main=" ", col="firebrick", las=1)
From the graph derived from the data, we were able to identify that same direction rear end, had the most types of crashes under clear weather. And clear weather overall was the most common type of weather were car crashes happened. Personally, my team mate and I thought rainy weather would be the most common type of weather where most crashes would happen since the rain can make tires have less grip on the road and be more prone to accidents.
We explored the data in greater detail for we wanted to figure if the surface condition when a car crash, correlated with the type of weather when a crash was reported. Since clear weather was most common, we would expect to have dry surface to be most common. We should expect a direct correlation between weather and surface condtion.
> #crashes by surface condition
> x = table(dat$Surface.Condition,dat$Collision.Type)
>
> cols = c("HEAD ON LEFT TURN","OTHER","SAME DIR REAR END","SAME DIRECTION SIDESWIPE","SINGLE VEHICLE","STRAIGHT MOVEMENT ANGLE")
> rows = c("DRY","WET","SNOW")
> z = x[rows,cols]
> plot(z, main=" ", las=1, col="firebrick")
Based on the graph, our expectations were met. And we also noted how same direction rear end still was the most common type of crash under the different types of surface condtion. We also wanted to verify that wet surface condtion had the same correlation to rainy weather.
We also got curious and interested in accidents involving bicycles. We decided to only look at the incidents in which the driver was at fault and not distracted because these situations are the ones where we can hopefully correct. This gave us almost 200 data points to analyze. We did some inital plots to try and find some pattern to these accidents.This first graph shows the total number of accidents that occured for each month.
> par(mar=c(2,9,2,2))
> x =table(substr(dat.sub$Crash.Date.Time,1,2))
> names(x) = c("Jan.","Feb.","Mar.","Apr.","May","Jun.","Jul.","Aug.","Sep.","Oct.","Nov.","Dec.")
> barplot(x, las = 1, col="firebrick", main="Num. of Accidents Per Month")
Not too surprisingly, the amount of crashes increases throughout the year as the weather is more suitable for riding bicyles, and dips down during the winter season.
After the frist graph, we wondered if the number crashes would also increase during times of heavy traffic during a day, such as when people are commuting to or from work/school.
> #formating the Crash.Date.Time Column
> dat.sub$Crash.Date.Time <- as.POSIXlt(as.character(dat.sub$Crash.Date.Time), format = "%m/%d/%Y %I:%M:%S %p")
> par(mar=c(4,2.5,2,2))
> y = table(substr(dat.sub$Crash.Date.Time,12,13))
> names(y) = c("12am","1am","5am","6am","7am","8am","9am","10am","11am","12pm","1pm","2pm","3pm","4pm","5pm","6pm","7pm","8pm","9pm","10pm","11pm")
> barplot(y,las=1,col = "firebrick",main="Total Numb. of Accidents Per Each Hour", xlab = "Hour")
The graph confirmed our inital suspisions by showing that the number of crashes spiked during common comute times.
Next, we decided to see if we could gain any insights in regards to where these accidents happen so we plotted them onto a map of the county. Once again, we compared the map of bicycle related accidents to the rest of the data.
> library(ggmap)
>
> par(mfrow=c(2,2))
> map = get_map(c(-77.08025,39.07991),maptype = "road",zoom=11)
> p = ggmap(map)
> p = p + geom_point(data=dat.sub, aes(x=Longitude,y=Latitude),color="firebrick",size=3)
> print(p)
> map = get_map(c(-77.08025,39.07991),maptype = "road",zoom=13)
> j = ggmap(map)
> j = j + geom_point(data=dat.subwb, aes(x=Longitude,y=Latitude),color="firebrick",size=3)
> print(j)
Although not too visible from the small graph, it seemed to indicate that a significant amount of the incidents occurred at or near an intersection.
Based on the results of the previous map, we decided to graph the movement of the vehicle. To better get an insight into this, it was compared to the movement of the vehicle in the overall dataset.
> par(mfrow=c(2,1))
> par(mar=c(2,13,2,2))
> barplot(head(sort(table(dat.sub$Vehicle.Movement),decreasing = TRUE)),col="firebrick",las=1,horiz = TRUE,main="Vehicle Movement w/ Bicycle Involvment")
>
> par(mar=c(2,13,2,2))
> barplot(head(sort(table(dat.subwb$Vehicle.Movement),decreasing = TRUE)),col="firebrick",las=1,horiz = TRUE,main="Vehicle Movement")
From the plot, we learned that at least about 1/3 of the incidents involved the vehicle making some sort of turn when a bicylce is involved. This is in contrast to the rest of dataset in which only one of the top six vehicle movements involves the vehicle during a turn.
Finally, we decided to generalize the vehicle movement category into four categories in order to simplify maping the points. The turning sub-category indicates incidents in which the vehicle was making any type of turn (left,right,U,etc.), the accelerating sub-category shows incidents where the vehicle was either changing speed (slowing dow/speeding up) or changing lanes (including merging into traffic).
> turning=c("MAKING LEFT TURN", "MAKING RIGHT TURN","MAKING U TURN","RIGHT TURN ON RED")
>
> acc = c("ACCELERATING","ENTERING TRAFFIC LANE","STARTING FROM PARKED","STARTING FROM PARKED","STARTING FROM LANE", "SLOWING OR STOPPING","CHANGING LANES")
>
> stat = c("PARKED","STOPPED IN TRAFFIC LANE")
>
> other = c("BACKING","SKIDDING", "PASSING" , "MOVING CONSTANT SPEED")
>
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% turning] <- "TURNING"
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% acc] <- "ACCELERATING"
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% other] <- "OTHER"
> dat.sub$Vehicle.Movement[dat.sub$Vehicle.Movement %in% stat] <- "STATIONARY"
>
>
> map = get_map(c(-77.08025,39.07991),maptype = "road",zoom=11)
> p = ggmap(map)
> p = p + geom_point(data=dat.sub, aes(x=Longitude,y=Latitude, color = Vehicle.Movement),size=3)
>
> print(p)
From this last map, we observed that accidents where the vehicle was turning seem to be a bit more prone to clustering up, at least more than the other categories. This might indicate that there are certain locations or intersections in the county where it might be difficult to see a bicyclist, or a road where there is no designated bicycle lane.
In conclusion, we have analyzed different scenarios of car collisions obtained from the data of Montgomery County, Maryland. From the data, we learned that the most common and frequent type of collision is same direction rear end. Many of the collision happen during the month of January, the most common type of report type is a property damage crash, and most crashes happen during clear weather.
We also analyzed car callision that involved bicycle accidents. From the data, we found out that most bicycle related accidents happen during July, as the weather is most favorable, and that most bycicle incidents happen during 5pm. Interestingly enough, we found out that most bicycle incidents occurred at or near an intersection. We hope with the insights from this report we can provide insight to reader and reduce the amount of car collisions and bicycle related accidents.